---
title: Index Creation
---

## Overview

A BM25 index must be created over table's columns before they can be searched.
This index is strongly consistent, which means that new data is immediately searchable across all connections.
Once an index is created, it automatically stays in sync with the underlying table as the data changes.

## Basic Usage

The following code block creates an index called `search_idx` over the `description` and `rating` columns of the `mock_items`
table, which was created in the [quickstart](/search/quickstart#full-text-search).

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description'),
  numeric_fields => paradedb.field('rating')
);
```

<ParamField body="index_name" required>
  The name of the index. The index name can be anything, as long as doesn't
  conflict with an existing index or schema. A new schema with associated query
  functions will be created with this name.
</ParamField>
<ParamField body="table_name" required>
  The name of the table being indexed.
</ParamField>
<ParamField body="key_field" required>
  The name of a column in the table that represents a unique identifier for each
  record. Usually, this is the same column that is the primary key of the table.
  In version 0.7.5 and earlier, only integer IDs were supported. From version
  0.7.6 onwards, [non-integer IDs are also
  supported.](https://github.com/paradedb/paradedb/pull/1174).
</ParamField>
<ParamField body="schema_name" default="CURRENT SCHEMA">
  The name of the schema, or namespace, of the table.
</ParamField>

This example query will create a schema called `search_idx`, which contains a `search` function.

```sql
SELECT * FROM search_idx.search('description:keyboard');
```

## Field Types

### Text Fields

Columns of type `VARCHAR`, `TEXT`, `UUID`, and their corresponding array types can be indexed as text fields.

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description', indexed => true)
);
```

`paradedb.field` accepts the following configuration options for text fields:

<ParamField body="indexed" default={true}>
  Whether the field is indexed. Must be `true` in order for the field to be
  tokenized and searchable.
</ParamField>
<ParamField body="stored" default={true}>
  Whether the original value of the field is stored.
</ParamField>
<ParamField body="fast" default={false}>
  Fast fields can be random-accessed rapidly. Fields used for aggregation must
  have `fast` set to `true`. Fast fields are also useful for accelerated scoring
  and filtering.
</ParamField>
<ParamField body="fieldnorms" default={true}>
  Fieldnorms store information about the length of the text field. Must be
  `true` to calculate the BM25 score.
</ParamField>
<ParamField body="tokenizer" default="default">
  A `JSONB` produced by `paradedb.tokenizer` which specifies the tokenizer and
  tokenizer configuration options. See [tokenizers](#tokenizers) for a list of
  available tokenizers.
</ParamField>
<ParamField body="record" default="position">
  Describes the amount of information indexed. See [record](#record) for a list
  of available record types.
</ParamField>
<ParamField body="normalizer" default="raw">
  The name of the tokenizer used for fast fields. This field is ignored unless
  `fast=true`. See [normalizers](#normalizers) for a list of available
  normalizers.
</ParamField>

### Numeric Fields

Columns of type `SMALLINT`, `INTEGER`, `BIGINT`, `OID`, `REAL`, `DOUBLE PRECISION`, `NUMERIC`, and their corresponding array types can be indexed
as `numeric_fields`. The main reason to index a numeric field is if it is used for filtering or aggregations as part of the search query.

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  numeric_fields => paradedb.field('rating', indexed => true)
);
```

`paradedb.field` accepts the following configuration options numeric fields:

<ParamField body="indexed" default={true}>
  Whether the field is indexed. Must be `true` in order for the field to be
  tokenized and searchable.
</ParamField>
<ParamField body="stored" default={true}>
  Whether the original value of the field is stored.
</ParamField>
<ParamField body="fast" default={true}>
  Fast fields can be random-accessed rapidly. Fields used for aggregation must
  have `fast` set to `true`. Fast fields are also useful for accelerated scoring
  and filtering.
</ParamField>

### Boolean Fields

Columns of type `BOOLEAN` and `BOOLEAN[]` can be indexed as `boolean_fields`. Indexing a boolean field is useful if
it is used for filtering as part of the search query.

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  boolean_fields => paradedb.field('in_stock')
);
```

`paradedb.field` accepts several configuration options for boolean fields:

<ParamField body="indexed" default={true}>
  Whether the field is indexed. Must be `true` in order for the field to be
  tokenized and searchable.
</ParamField>
<ParamField body="stored" default={true}>
  Whether the original value of the field is stored.
</ParamField>
<ParamField body="fast" default={true}>
  Fast fields can be random-accessed rapidly. Fields used for aggregation must
  have `fast` set to `true`. Fast fields are also useful for accelerated scoring
  and filtering.
</ParamField>

### JSON Fields

Columns of type `JSON` and `JSONB` can be indexed as `json_fields`. Once indexed, search can be
performed on nested text fields within JSON values.

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  json_fields => paradedb.field('metadata')
);
```

`paradedb.field` accepts several configuration options for JSON fields:

<ParamField body="indexed" default={true}>
  Whether the field is indexed. Must be `true` in order for the field to be tokenized and
  searchable.
</ParamField>
<ParamField body="stored" default={true}>
  Whether the original value of the field is stored.
</ParamField>
<ParamField body="fast" default={false}>
  Fast fields can be random-accessed rapidly. Fields used for aggregation must have `fast` set to `true`.
  Fast fields are also useful for accelerated scoring and filtering.
</ParamField>
<ParamField body="expand_dots" default={true}>
  If `true`, JSON keys containing a `.` will be expanded. For instance, if `expand_dots` is `true`,
  `{"metadata.color": "red"}` will be indexed as if it was `{"metadata": {"color": "red"}}`.
</ParamField>
<ParamField body="tokenizer" default="default">
  The name of the tokenizer. See [tokenizers](#tokenizers) for a list of available tokenizers.
</ParamField>
<ParamField body="record" default="position">
  Describes the amount of information indexed. See [record](#record) for a list of available
  record types.
</ParamField>
<ParamField body="normalizer" default="raw">
  The name of the tokenizer used for fast fields. This field is ignored unless `fast=true`. See
  [normalizers](#normalizers) for a list of available normalizers.
</ParamField>

### Datetime Fields

Columns of type `DATE`, `TIMESTAMP`, `TIMESTAMPTZ`, `TIME`, `TIMETZ`, and their corresponding array types can be indexed as `datetime_fields`.
Indexing a datetime field is useful if it is used for filtering as part of the search query.

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  datetime_fields => paradedb.field('created_at')
);
```

`paradedb.field` accepts several configuration options for boolean fields:

<ParamField body="indexed" default={true}>
  Whether the field is indexed. Must be `true` in order for the field to be
  tokenized and searchable.
</ParamField>
<ParamField body="stored" default={true}>
  Whether the original value of the field is stored.
</ParamField>
<ParamField body="fast" default={true}>
  Fast fields can be random-accessed rapidly. Fields used for aggregation must
  have `fast` set to `true`. Fast fields are also useful for accelerated scoring
  and filtering.
</ParamField>

## Multiple Fields

The `||` operator can be used to index multiple fields.

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description') || paradedb.field('category')
);
```

## Configuration Options

### Tokenizers

Tokenizers are responsible for breaking down text fields into smaller, searchable components called tokens.
Once a field is tokenized, all search queries against that field are automatically tokenized the same way.

The following code block demonstrates the syntax for specifying a tokenizer for a text field.

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field(
    name => 'description',
    tokenizer => paradedb.tokenizer('en_stem')
  )
);
```

Some tokenizers like the `stem` and `ngram` tokenizer accept configuration options. They can be passed to the
`paradedb.tokenizer` function.

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field(
    name => 'description',
    tokenizer => paradedb.tokenizer('stem', language => 'French')
  )
);
```

The following tokenizer names are accepted by the `paradedb.tokenizer` function.

<ParamField body="default">
  Tokenizes the text by splitting on whitespace and punctuation, filters out tokens that
  are larger than 255 bytes, and converts to lowercase.
</ParamField>
<ParamField body="whitespace">
  Tokenizes the text by splitting on whitespace only, filters out tokens that
  are larger than 255 bytes, and converts to lowercase.
</ParamField>
<ParamField body="raw">
  Does not process nor tokenize text. Filters out tokens larger than 255 bytes.
</ParamField>
<ParamField body="lowercase">
  Equivalent to `raw`, but also converts to lowercase.
</ParamField>
<ParamField body="regex">
  Tokenizes text using a regular expression. The regular expression can be specified with the `pattern` parameter.
  <Expandable title="Config Options">
    <ParamField body="pattern">
      The regular expression pattern used to tokenize the text. Example: `\\W+` splits on non-word characters.
    </ParamField>
  </Expandable>
</ParamField>
<ParamField body="ngram">
  Tokenizes text by splitting words into overlapping substrings based on the specified parameters.
  <Expandable title="Config Options">
    <ParamField body="min_gram">
     Defines the minimum length for the n-grams. For instance, if set to 2, the smallest token created would be of length 2 characters.
    </ParamField>
    <ParamField body="max_gram">
    Determines the maximum length of the n-grams. If set to 5, the largest token produced would be of length 5 characters.
    </ParamField>
    <ParamField body="prefix_only">
    When set to true, the tokenizer generates n-grams that start from the beginning of the word only, ensuring a prefix progression. If false, n-grams are created from all possible character combinations within the `min_gram` and `max_gram` range.
    </ParamField>
  </Expandable>
</ParamField>
<ParamField body="chinese_compatible">
  Tokenizes text considering Chinese character nuances. Splits based on whitespace and punctuation. Filters out tokens larger than 255 bytes.
</ParamField>
<ParamField body="chinese_lindera">
  Tokenizes text using the Lindera tokenizer, which uses the CC-CEDICT dictionary to segment and tokenize text.
</ParamField>
<ParamField body="korean_lindera">
  Tokenizes text using the Lindera tokenizer, which uses the KoDic dictionary to segment and tokenize text.
</ParamField>
<ParamField body="japanese_lindera">
  Tokenizes text using the Lindera tokenizer, which uses the IPADIC dictionary to segment and tokenize text.
</ParamField>
<ParamField body="icu">
  Tokenizes text using the ICU tokenizer, which uses Unicode Text Segmentation and is suitable for tokenizing most
  languages.
</ParamField>
<ParamField body="stem">
  Applies multi-language stemming to tokens, filters out tokens larger than 255 bytes, and converts to lowercase. The language can be specified with the
  `language` parameter.

{" "}

<Expandable title="Config Options">
  <ParamField body="language">
    Options are `Arabic`, `Danish`, `Dutch`, `English`, `Finnish`, `French`,
    `German`, `Greek`, `Hungarian`, `Italian`, `Norwegian`, `Portuguese`,
    `Romanian`, `Russian`, `Spanish`, `Swedish`, `Tamil`, and `Turkish`.
  </ParamField>
</Expandable>

</ParamField>
<ParamField body="en_stem" deprecated={true}>
  Equivalent to `stem` with language set to `English`.
</ParamField>

### Fast Fields

A field that is indexed as `fast` is stored in a column-oriented fashion. Fast fields are necessary for
[aggregations/faceting](/search/full-text/aggregations). They can also improve the query times of [filters](/search/full-text/bm25#efficient-filtering)
and BM25 scoring.

The following code block demonstrates how to specify a fast field.

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  numeric_fields => paradedb.field('rating', fast => true)
);
```

### Normalizers

Normalizers specify how text and JSON fast fields should be processed. This option is ignored over any
other type of field.

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('category', fast => true, normalizer => 'raw')
);
```

<ParamField body="raw">
  Does not process nor tokenize text. Filters out tokens larger than 255 bytes.
</ParamField>
<ParamField body="lowercase">
  Applies a lowercase transformation on the text. Filters token larger than 255
  bytes.
</ParamField>

### Record

The `record` option specifies how much information is recorded with an indexed field.

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description', record => 'position')
);
```

<ParamField body="basic">Records only the document IDs.</ParamField>
<ParamField body="freq">
  Records the document IDs as well as term frequency. This is useful for BM25
  scoring.
</ParamField>
<ParamField body="position">
  Records the document ID, term frequency and positions of occurrences. Required
  to run a phrase query.
</ParamField>

## Deleting a BM25 Index

The following command deletes a BM25 index, as well as its associated schema and query functions:

```sql
CALL paradedb.drop_bm25(index_name => '<index_name>');
```

<Accordion title="Example Usage">

```sql
CALL paradedb.drop_bm25(
  index_name => 'search_idx',
  schema_name => 'public'
);
```

</Accordion>

<ParamField body="index_name" required>
  The name of the index you wish to delete.
</ParamField>
<ParamField body="schema_name" default="CURRENT SCHEMA">
  The name of the schema that the index was created in.
</ParamField>

## Inspecting a BM25 Index

The `schema` function returns a table with information about the index schema.

```sql
SELECT * FROM <index_name>.schema();
```

<Accordion title="Example Usage">

```sql
SELECT * FROM search_idx.schema();
```

</Accordion>

<ParamField body="index_name" required>
  The name of the index.
</ParamField>

## Partial BM25 Index

The following code block demonstrates how to pass predicates to `create_bm25`
to construct a [partial index](https://www.postgresql.org/docs/current/indexes-partial.html). Partial
indexes are useful for reducing index size on disk and improving update speeds over non-indexed rows.

```sql
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description'),
  predicates => 'category = ''Electronics'' AND rating > 2'
);
```

## Legacy Syntax

The `paradedb.field` and `paradedb.tokenizer` functions were introduced in `0.8.6`. These functions are
syntactic sugar for generating `JSONB`. Users that prefer the old syntax can still pass `jsonb` into `_fields`.

```sql
-- These two queries are equivalent
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => '{"description": {"tokenizer": {"type": "en_stem"}}}'
);

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field(
    'description',
    tokenizer => paradedb.tokenizer('en_stem')
  )
);
```
